NSF PAR Search | NSF Public Access Repository

Low latency RNN inference with cellular batching

https://doi.org/10.1145/3190508.3190541

Gao, Pin; Yu, Lingfan; Wu, Yongwei; Li, Jinyang (April 2018, EuroSys '18: Proceedings of the Thirteenth EuroSys Conference)

Performing inference on pre-trained neural network models must meet the requirement of low-latency, which is often at odds with achieving high throughput. Existing deep learning systems use batching to improve throughput, which do not perform well when serving Recurrent Neural Networks with dynamic dataflow graphs. We propose the technique of cellular batching, which improves both the latency and throughput of RNN inference. Unlike existing systems that batch a fixed set of dataflow graphs, cellular batching makes batching decisions at the granularity of an RNN "cell" (a subgraph with shared weights) and dynamically assembles a batched cell for execution as requests join and leave the system. We implemented our approach in a system called BatchMaker. Experiments show that BatchMaker achieves much lower latency and also higher throughput than existing systems.

Full Text Available

Search for: All records